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Abstract — This paper presents new lower and upper 
bounds for the compression rate of optimal binary prefix 
codes on memoryless sources according to various nonlin- 
ear codeword length objectives. Like the most well-known 
redundancy bounds for minimum (arithmetic) average 
redundancy coding — Huffman coding — these are in 
terms of a form of entropy and/or the probability of the 
most probable input symbol. The bounds here improve on 
known bounds of the form L e [H, H + 1), where H is 
some form of entropy in bits (or, in the case of redundancy 
measurements, 0) and L is the length objective, also in bits. 
The objectives explored here include exponential-average 
length, maximum pointwise redundancy, and exponential- 
average pointwise redundancy (also called d th exponential 
redundancy). These relate to queueing and single-shot 
communications, Shannon coding and universal model- 
ing (worst-case minimax redundancy), and bridging the 
maximum pointwise redundancy problem with Huffman 
coding, respectively. A generalized form of Huffman coding 
known to find optimal codes for these objectives helps yield 
these bounds, some of which are tight. Related properties 
to such bounds, also explored here, are the necessary and 
sufficient conditions for the shortest codeword being a 
specific length. 

Index Terms — Huffman codes, optimal prefix code, 
queueing, Renyi entropy, Shannon codes, worst case min- 
imax redundancy. 



I. Introduction 

Since Shannon introduced information theory, we have 
had entropy bounds for the expected codeword length 
of optimal lossless fixed-to-variable-length binary codes. 
The lower bound is entropy, while upper bound — cor- 
responding to a maximum average redundancy of one bit 
for an optimal code, thus yielding unit-sized bounds — 
follows from a suboptimal code, the Shannon code, for 
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which the codeword length of an input of probability p 
is [~— log 2 p] [1]. Huffman found a method of producing 
an optimal code by building a tree in which the two 
nodes with lowest weight (probability) are merged to 
produce a node with their combined weight [2]. On the 
occasion of the twenty-fifth anniversary of the Huffman 
algorithm, Gallager introduced bounds in terms of the 
most probable symbol which improved on the unit-sized 
redundancy bound [3]. Both upper and lower bounds 
have since been improved given this most probable 
symbol [4]-[8] such that these bounds are tight when this 
symbol's probability is at least 1/127 (and close-to-tight 
when it has lower probability). Such bounds are useful 
for quickly bounding the performance of an optimal code 
without running the algorithm that would produce the 
code. The bounds are for a fixed-sized input; asymptotic 
treatment of redundancy for block codes of growing size, 
based on binary memoryless sources, is given in [9]. 

Others have given consideration to objectives other 
than expected codeword length [10, §2.6]. Many of these 
nonlinear objectives, which have a variety of applica- 
tions, also have unit-sized bounds but have heretofore 
lacked tighter closed-form bounds achieved using the 
most probable symbol and, if necessary, some form 
of entropy. We address such problems here, finding 
upper and lower bounds for the optimal codes of given 
probability mass functions for nonlinear objectives. 

A lossless binary prefix coding problem takes a prob- 
ability mass function p = {pi}, defined for all i in the 
input alphabet X, and finds a binary code for X. Without 
loss of generality, we consider an n-item source emitting 
symbols drawn from the alphabet X = {1,2,... ,n} 
where {pi} is the sequence of probabilities for possible 
symbols (pi > for i E X and YliexPi = -*-) ^ n 
monotonically nonincreasing order (pi > pj for i < 
j). Thus the most probable symbol is p\. The source 
symbols are coded into binary codewords. The codeword 
Cj £ {0, 1}* in code c, corresponding to input symbol i, 
has length Zj, defining length vector I. 

The goal of the traditional coding problem is to find 
a prefix code minimizing expected codeword length 
YliexPilii or > equivalently, minimizing average redun- 



2 



dancy 

R(l,p) ± Y,Pik ~ H(p) = + km) (1) 

ieX i&X 

where H is — Yliex Pi ^SPi (Shannon entropy) and lg = 
log 2 . A prefix code is a code for which no codeword 
begins with a sequence that also comprises the whole of 
a second codeword. 

This problem is equivalent to finding a minimum- 
weight external path among all rooted binary trees, due 
to the fact that every prefix code can be represented as 
a binary tree. In this tree representation, each edge from 
a parent node to a child node is labeled (left) or 1 
(right), with at most one of each type of edge per parent 
node. A leaf is a node without children; this corresponds 
to a codeword, and the codeword is determined by the 
path from the root to the leaf. Thus, for example, a leaf 
that is the right-edge (1) child of a left-edge (0) child 
of a left-edge (0) child of the root will correspond to 
codeword 001. Leaf depth (distance from the root) is thus 
codeword length. If we represent external path weight 
as J2i£X w {fyi, ^e weights are the probabilities (i.e., 
w(i) = pi), and, in fact, we refer to the problem inputs 
as {w(i)} for certain generalizations in which their sum, 
^2itxw(i), need not be 1. 

If formulated in terms of I, the constraints on the 
minimization are the integer constraint (i.e., that codes 
must be of integer length) and the Kraft inequality [11]; 
that is, the set of allowable codeword length vectors is 

C n = 1 1 e 7L\ such that 2~ u < 1 j . 

Because Huffman's algorithm [2] finds codes mini- 
mizing average redundancy (fl}, the minimum-average 
redundancy problem itself is often referred to as the 
"Huffman problem," even though the problem did not 
originate with Huffman himself. Huffman's algorithm 
is a greedy algorithm built on the observation that the 
two least likely items will have the same length and 
can thus be considered siblings in the coding tree. A 
reduction can thus be made in which the two items of 
weights w(i) and w(j) can be considered as one with 
combined weight w(i) + w(j), and the codeword of the 
combined item determines all but the last bit of each of 
the items combined, which are differentiated by this last 
bit. This reduction continues until there is one item left, 
and, assigning this item the null string, a code is defined 
for all input items. In the corresponding optimal code 
tree, the i th leaf corresponds to the codeword of the i th 
input item, and thus has weight w{i), whereas the weight 
of parent nodes are determined by the combined weight 
of the corresponding merged item. 



Shannon [1] had previously shown that an optimal i opt 
must satisfy 

iT(p)<£>Z° pt <H(p) + l 

ieX 

or, equivalently, 

< R(l opt ,p) < 1. 

Less well known is that simple changes to the Huffman 
algorithm solve several related coding problems which 
optimize for different objectives. We discuss three such 
problems, all three of which have been previously shown 
to satisfy redundancy bounds for optimal I of the form 

H(p)<L(p,i)<H(p) + l 

or 

< R(l,p) < 1 

for some entropy measure H and cost measure L, or 
some redundancy measure R. 

Many authors consider generalized versions of the 
Huffman algorithm [12]— [15]. These generalizations 
change the combining rule; instead of replacing items 
i and j with an item of weight w(i) + w(j), the 
generalized algorithm replaces them with an item of 
weight f(w(i),w(j)) for some function /. Thus the 
weight of a combined item (a node) no longer need be 
equal to the sum of the probabilities of the items merged 
to create it (the sum of the leaves of the corresponding 
subtree). This has the result that the sum of weights in 
a reduced problem need not be 1, unlike in the original 
Huffman algorithm. In particular, the weight of the root, 
w roo t, need not be 1. However, we continue to assume 
that the sum of inputs to the coding problems will be 1 
(with the exception of reductions among problems). 

A. Maximum pointwise redundancy 

The most recent variation in problem objective we 
consider is the problem proposed by Drmota and Sz- 
pankowski [16]. Instead of minimizing average redun- 
dancy R{l,p) = YliexPi(k + Is Pi)' nere we minimize 
maximum pointwise redundancy 

R*(l,p) = max(/; +lgp*). 

iE.X 

Originally solved via an extension of Shannon coding, 
this was later noted to be solvable via a variation of 
Huffman coding [17] derived from that in [18], one using 
combining rule 

f*(w(i),w(j)) =2max(w(i),w(j)). (2) 

The solution of this worst-case pointwise redundancy 
problem is relevant to optimizing maximal (worst-case) 
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minimax redundancy, a universal modeling problem for 
which the set V of possible probability distributions re- 
sults in a normalized "maximum likelihood distribution." 
[16] More recently Gawrychowski and Gagie proposed 
a different worst-case redundancy problem which also 
finds its solution in minimizing maximum pointwise 
redundancy, one for which normalization is not relevant 
and one which assumes all probability distributions are 
possible (rather than just a subset V of the probability 
simplex) [19]. 

The first proposed algorithm for the maximum point- 
wise redundancy problem is closely related to Shannon 
coding. It is based on the Shannon code so that each 
codeword is the same length as or one bit shorter 
than the corresponding codeword in the Shannon code. 
This method is called "generalized Shannon coding." 
(With proper tie-breaking techniques, the Huffman-like 
solution guarantees that each codeword, in turn, is no 
longer than the generalized Shannon codeword. As both 
are optimal, this makes no difference in the maximum 
pointwise redundancy.) In Section [TTJ we see that an 
optimal code for p has redundancy R* t (p) G [0, 1), an 
already-known bound improved upon therein. The upper 
bound is easily illustrated by noticing that a Shannon 
code (and thus a generalized Shannon code), unlike a 
Huffman code, guarantees that a given codeword has a 
length within one bit of the associated input symbol's 
self-information, — lgpj. 

B. d th exponential redundancy 

A spectrum of problems bridges the objective of 
Huffman coding with the objective optimized by general- 
ized Shannon coding (and its Huffman-like alternative) 
using an objective proposed in [20] and solved for in 
[13]. In this particular context, the range of problems, 
parameterized by a variable d, can be called d th expo- 
nential redundancy [17]. This is the minimization of the 
following: 

R d (l,p) 4 ilg^pj 4 "*** = ll g J2pi2 d(h+lgPi) - 

a iex a iex 

(3) 

Although d > is the case we consider most often here, 
d 6 (—1, 0) is also a valid problem. If we let d — > 0, we 
approach the average redundancy (Huffman's objective), 
while d — > oo is maximum pointwise redundancy (with 
its Shannon-like solution) [17]. The combining rule, 
introduced in [13, p. 486], is 

f d (w(i),w(j)) = (2 d w(i) 1+d + 2 d w(j) 1+d ) ^ . (4) 

As we show at end of Section IH the upper 
bound for maximum pointwise redundancy also im- 



proves upon the already-known bound — -Ropt(p) — 
mmi e c n R d pt (l,p) € [0,1) — for this problem, as 
maximum pointwise redundancy is an upper bound for 
d th exponential redundancy. A lower bound is obtained 
by observing the reverse relationship with the average 
redundancy problem. 

C. Exponential average 

A related problem is one proposed by Campbell 
[21], [22]. This exponential problem, given probability 
mass function p and a € (0,oo)\l, is to find a code 
minimizing 

L a (p, I) = log a ^2 Pi ah ■ ( 5 ) 

The solution to this [12], [13], [23] uses combining rule 

f a (w(i),w(J)) = aw(i) + aw(j). (6) 

A change of variables transforms the d th exponential 
redundancy problem into © by assigning a = lg d and 
using input weights w(i) proportional to p\ +d , which 
yields (0]). We illustrate this precisely at the end of 
Section [TT] in (M~5T >. which we use in Section [III] to find 
initial improved entropy bounds. These are supplemented 
by additional bounds for problems with a G (0.5, 1) and 
Pi > 2a/ (2a + 3) at the end of Section HTT1 

It is important to note here that a > 1 is an average 
of growing exponentials, while a < 1 is an average 
of decaying exponentials. Because of this, these two 
subproblems have different properties and have often 
been considered separately in the literature. The a > 
1 variation of (f5]) was used in Humblet's dissertation 
[24] for a queueing application originally proposed by 
Jelinek [25] and expounded upon in [26]. This problem is 
one in which overflow probability should be minimized, 
where the source produces symbols randomly and the 
codewords are temporarily stored in a finite buffer. 
The Huffman-like coding method was simultaneously 
published in [12], [13], [23]; in the last of these, Humblet 
noted that the Huffman combining method © finds the 
optimal code with a € (0, 1) as well. An application 
for this decaying exponential variant involving single- 
shot communications has a communication channel with 
a window of opportunity of a total duration (in bits) 
distributed geometrically with parameter a [27]. The 
probability of successful transmission is 

P[success] = a L °( p ' l \ (7) 

For a > 0.5, the unit-sized bound we improve upon is in 
terms of Renyi entropy, as in (|T3T >; the solution is trivial 
for a < 0.5, as we note at the start of Section UTTl 
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Note that a — ► 1 corresponds to the usual linear 
expectation objective. Problems for a near 1 are of 
special interest, since a J, 1 corresponds to the minimum- 
variance solution if the problem has multiple solutions 
— as noted in [13], among others — while a j 1 
corresponds to the maximum-variance solution. 

The aforementioned improved bounds are in all cases 
based on a given highest symbol probability, p\. We 
also discuss the related issue of the length of the most 
likely codeword in these coding problems. These bounds 
are the first of their kind for nontraditional Huffman 
codes, bounds which are functions of both entropy (if 
applicable) and p\, as in the traditional case. However, 
they are not the first improved bounds for such codes. 
More sophisticated bounds on the optimal solution for 
the exponential-average objective are given in [26] for 
a > 1; these appear as solutions to related problems 
rather than in closed form, however, and these problems 
require no less time or space to solve than the original 
problem. They are mainly useful for analysis. Bounds 
given elsewhere for a closely related objective having 
a one-to-one correspondence with © are demonstrated 
under the assumption that p\ > 0.4 always implies l\ 
can be 1 for the optimal code [28]. We show that this 
is not necessarily the case due to the difference between 
the exponential-average objective and the usual objective 
of an arithmetic average. 

In the next section, we find tight exhaustive bounds 
for the values of optimal R*(l,p) and corresponding 

in terms of p±, then find how we can extend these 
to exhaustive — but not tight — bounds for optimal 
R d (l,p). In Section ITITl as previously noted, we inves- 
tigate the behavior of optimal L a (p,l) and l\ in terms 
of p\. The main results are given as theorems, corollaries, 
and remarks immediately following them, and many are 
also illustrated as figures. 

II. Bounds on Redundancy Problems 
A. Maximum pointwise redundancy bounds 

Shannon found redundancy bounds for R opt (p), the 
average redundancy R(l,p) = Yl,i£xP^i ~ H(p) °f tne 
average redundancy-optimal I. The simplest bounds for 
minimized maximum pointwise redundancy 

K P t(p) - min max (h + lg Pi) 

are quite similar to and can be combined with Shannon's 
bounds as follows: 

< R op t(p) < K pt (p) < 1 (8) 

The average redundancy case is a lower bound be- 
cause the maximum (R*(l,p)) of the values (k + Igpi) 



that average to a quantity (R(l,p)) can be no less 
than the average (a fact that holds for all I and p). 
The upper bound is due to Shannon code lf(p) = 
r-lgFil resulting in R* pt (p) < R*{l°(p),p) = 
max ieX ([-lgpi] +lgPi) < 1. 

A few observations can be used to find a series of im- 
proved lower and upper bounds on optimum maximum 
pointwise redundancy based on ([8]): 

Lemma 1: Suppose we apply © to find a Huffman- 
like code tree in order to minimize maximum pointwise 
redundancy. Then the following holds: 

1) Items are always merged by nondecreasing weight. 

2) The weight of the root u) root of the coding tree 
determines the maximum pointwise redundancy, 

R*(l,p) = lgU>root- 

3) The total probability of any subtree is no greater 
than the total weight of the subtree. 

4) If pi < 2p n _i, then a minimum maximum point- 
wise redundancy code can be represented by a 
complete tree, that is, a tree with leaves at depth 
[lgnj and [lgn] only (with YlieX 2 ~ U = 1 )- 
(This is similar to the property noted in [29] for 
optimal-expected-length codes of sources termed 
quasi-uniform in [30].) 

Proof: We use an inductive proof in which base 
cases of sizes 1 and 2 are trivial, and we use weight 
function w instead of probability mass function p to 
emphasize that the sums of weights need not necessarily 
add up to 1. Assume first that all properties here are true 
for trees of size n — 1 and smaller. We wish to show that 
they are true for trees of size n. 

The first property is true because f*(w(i),w(j)) = 
2 max(u;(i), u>(j)) > w(i) for any i and j; that is, a 
compound item always has greater weight than either of 
the items combined to form it. Thus, after the first two 
weights are combined, all remaining weights, including 
the compound weight, are no less than either of the two 
original weights. 

Consider the second property. After merging the two 
least weighted of n (possibly merged) items, the property 
holds for the resulting n — 1 items. For the n — 2 un- 
touched items, li+\gw{i) remains the same. For the two 
merged items, let l n -\ and w(n— 1) denote the maximum 
depth/weight pair for item n — 1 and l n and w(n) the pair 
for n. If and w' denote the depth/weight pair of the 
combined item, then V +\gw' = l n — l+lg(2max(u;(n— 
l),w(n))) = max(/ n _i + \gw(n - l),l n + lgw(n)), so 
the two trees have identical maximum redundancy, which 
is equal to lgu> root since the root node is of depth 0. 
Consider, for example, p = (0.5,0.3,0.2), which has 
optimal codewords with lengths I = (1,2,2). The first 
combined pair has /' + Igw' = 1 + lg0.6 = max(2 + 
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lgO.3,2 + lg0.2) = max(/ 2 + lgp 2 ,k + IgPa). This 
value is identical to that of the maximum redundancy, 

lgl.2 = lgW root . 

For the third property, the first combined pair yields 
a weight that is no less than the combined probabilities. 
Thus, via induction, the total probability of any (sub)tree 
is no greater than the weight of the (sub)tree. 

In order to show the final property, first note that 
J2iex = 1 f° r an y tree crea ted using the Huffman- 
like procedure, since all internal nodes have two children. 
Now think of the procedure as starting with a queue of 
input items, ordered by nondecreasing weight from head 
to tail. After merging two items, obtained from the head 
of the queue, into one compound item, that item is placed 
back into the queue as one item, but not necessarily at 
the tail; an item is placed such that its weight is no 
smaller than any item ahead of it and is smaller than 
any item behind it. In keeping items ordered, this results 
in an optimal coding tree. A variant of this method can 
be used for linear-time coding [17]. 

In this case, we show not only that an optimal com- 
plete tree exists, but that, given an n-item tree, all items 
that finish at level [lg n] appear closer to the head of the 
queue than any item at level \lgn~\ — 1 (if any), using 
a similar approach to the proof of Lemma 2 in [27]. 
Suppose this is true for every case with n — 1 items for 
n > 2, that is, that all nodes are at levels [lg(n — 1)J or 
[lg(n — 1)], with the latter items closer to the head of 
the queue than the former. Consider now a case with n 
nodes. The first step of coding is to merge two nodes, 
resulting in a combined item that is placed at the end 
of the combined-item queue, as we have asserted that 
Pi < 2p n _i = 2max(p n _i,p n ). Because it is at the end 
of the queue in the n — 1 case, this combined node is at 
level [lg(n — 1)J in the final tree, and its children are at 
level 1 + [lg(n — 1)J = [lgn]. If n is a power of two, 
the remaining items end up on level lgn = \lg(n — 1)], 
satisfying this lemma. If n — 1 is a power of two, they 
end up on level lg(n — 1) = [lgn], a l so satisfying the 
lemma. Otherwise, there is at least one item ending up 
at level [lgn] = [lg(n — 1)] near the head of the queue, 
followed by the remaining items, which end up at level 
[lgnj = [lg(n — 1)J- m an y case, all properties of the 
lemma are satisfied for n items, and thus for any number 
of items. ■ 

We can now present the improved redundancy bounds. 

Theorem 1: For any distribution in which p\ > 2/3, 
KpdP) = 1 + IgPi- W Pl G [0-5, 2/3), then R*(p) G 



[1 + lgpi, 2 + lg(l — pi)) and these bounds are tight. De- 
fine A = [— lgpi]. Thus A satisfies p\ G [2~ A ,2' 
and A > 1 for pi € (0,0.5); in this range, the following 



bounds for R* opt {p) are tight: 

pi KAp) 



i i 



2 A i 2 A -1 
1 2 



2*-l> 2 A + 1 

2 1 
2 A +1' 2 A ~! 



A + lgpi,l + lgi^x 

L g i_2- A +i ' ' o 1-2— 

Igirf&r.A + lgpi 



Proof: The key here is generalizing the unit-sized 
bounds of ®. 

1 ) Upper bound: Before we prove the upper bound, 
note that, once proven, the tightness of the upper bound 
in [0.5, 1) is shown via 

P = (PiA-Pi ~ e,e) 

for which the bound is achieved in [2/3, 1) for any e G 
(0, (1 - pi)/2] and approached in [0.5, 2/3) as e j 0. 
Let us define what we call a first-order Shannon code: 



£(P) 



A 



V („. ( 1-2 



pi 



i-pi 



i G {2, 3, . . . , n} 



This code, previously presented in the context of finding 
average redundancy bounds given any probability [31], 
improves upon the original "zero-order" Shannon code 
I (p) by taking the length of the first codeword into 
account when designing the rest of the code. The code 
satisfies the Kraft inequality, and thus, as a valid code, 
its redundancy is an upper bound on the redundancy of 
an optimal code. Note that 

maxi>i(#0p) +lgPi) 

= maxi>i ( [lg p^rsy ] + ^Pi) (9) 

<1 + Ig^r. 

There are two cases: 

a) pi G [2/(2 A + 1),1/2 A - 1 ): In this case, the 
maximum pointwise redundancy of the first item is 
no less than 1 + lg((l - pi)/(l - 2~ A )), and thus 
R* pt (p) < R'tflp^p) = A + lgpi. If A > 1 and 
pi G [2/(2 A + 1), 1/2 A ~ 1 ), consider probability mass 
function 

/ \ 



P 



1 



Pi, 



Pi 



l-pi 



2 A 



2 A 



V 



2 A -2 



/ 



where e G (0, 1 - pi2 A_1 ). Because p l > 2/(2 A + 1), 
1 - Pi2 x ^ < (1 - pi - e)/(2 A - 2), and p n ^ > p n . 
Similarly, p\ < 1/2 A ~ 1 assures that pi > p 2 , so the 
probability mass function is monotonic. Since 2p n _i > 
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pi, by Lemma \T\ an optimal code for this probability 
mass function is Zj = A for all i, achieving R*(l,p) = 
A + lgpi, with item 1 having the maximum pointwise 
redundancy. 

b) pi e [1/2 A ,2/(2 A + 1)): In this case, © 
immediately results in R* t (p) < R*(l 1 {p),p) < 1 + 
lg((l -pi)/(l - 2~ A )). The probability mass function 



P 



Pi, 



1 - Pi 



1 - pi 



2 A - 1 



2 A - 1 



V 



2 A -1 



\ 



/ 



illustrates the tightness of this bound for e [ 0. This is 
a monotonic probability mass function for sufficiently 
small e, for which we also have pi < 2p n _i, so (again 
from Lemma Q]) this results in an optimal code with Z, = 
A for i € {1, 2, . . . , n — 2} and Z n _i = Z n = A + 1, and 
thus the bound is approached with item n — 1 having the 
maximum pointwise redundancy. 

2) Lower bound: Consider all optimal codes with 
l\ = H for some fixed /i € {1,2,...}. If p x > 2 _/ \ 
R*(l,p) >h + lgpi = M + lgpi- If Pi < 2 -/ \ consider 
the weights at level // (i.e., [i edges below the root). 
One of these weights is pi, while the rest are known 
to sum to a number no less than 1 — pi. Thus at least 
one weight must be at least (1 — pi)/(2" — 1) and 
R*(l,p) >// + lg((l-pi)/(2"-l)). Thus, 

1 - Pi 
2^-1 



Kptip) > M + lgmax (pi, 



for Zi = n, and, since fi can be any positive integer, 

1 - pi 



Kptip) ^ r m n in I M + lg max pi 7 

^{1,2,3,...} V \ 2^ — 1 

which is equivalent to the bounds provided. 

For pi G [l/(2" +1 - 1), 1/2") for some ^, consider 

/ \ 

1 - Pi 1 - Pi 



Pi 



V 



/ 



2^+1-2 

By Lemma [T] this has a complete coding tree — in this 
case with Zi one bit shorter than the other lengths — and 
thus achieves the lower bound for this range (A = fi+ 1). 
Similarly 



Pi, 2 



PI 



2f+ 1 -2 



has a fixed-length optimal coding tree for pi G 
[1/2^,1/(2" — 1)), achieving the lower bound for this 
range (A = fi). ■ 




0.7 - 

S 06 ■ 

0.5 - 
0.4 - 



Fig. 1. Tight bounds on minimum maximum pointwise redun- 
dancy, including achievable upper bounds (solid), approachable upper 
bounds (dashed), achievable lower bounds (dotted), and fully deter- 
mined values for p\ > 2/3 (dot-dashed). 



Note that the unit-sized bounds of ([8]) are identical 
to the tight bounds at (negative integer) powers of two. 
In addition, the tight bounds clearly approach and 
1 as pi I 0. This behavior is in stark contrast with 
average redundancy, for which bounds get closer, not 
further apart, illustrated by Gallager's redundancy bound 
(-Ropt(p) < Pi + 0.086 [3]) which cannot be significantly 
improved for small pi [8]. Moreover, approaching 1, 
the upper and lower bounds on minimum average re- 
dundancy coding converge but never merge, whereas the 
minimum maximum redundancy bounds are identical for 
Pi > 2/3. 



B. Minimized maximum pointwise redundancy codeword 
lengths 

In addition to finding redundancy bounds in terms of 
pi, it is also often useful to find bounds on the behavior 
of Zi in terms of pi, as was done for optimal average 
redundancy in [32]. 

Theorem 2: Any optimal code for probability mass 
function p, where pi > 2~ u , must have Zi < v. This 
bound is tight, in the sense that, for p\ < 2~ u , one can 
always find a probability mass function with Zi > v. 
Conversely, if pi < 1/(2" — 1), there is an optimal code 
with Zi > v, and this bound is also tight. 

Proof: Suppose p\ > 2~ v and Zi > 1 + v. Then 
Kptip) = R*(l,p) > h + lgpi > 1, contradicting the 
unit-sized bounds of ([8]). Thus Zi < v. 
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For tightness of the bound, suppose p\ G (2 
and consider n = 2 y+l and 



P 



Pi, 2 



v— l 2 _ i 2 ^ 



Pi 



If /i < i/, then, by the Kraft inequality, one of I2 
through l n -\ must exceed v. However, this contradicts 
the unit-sized bounds of ([8]). For Pi = 2~ ly - 1 , a uniform 
distribution results in l\ = v + 1. Thus, since these two 
results hold for any v, this extends to all p\ < 2~ u ~ 1 , 
and this bound is tight. 

Suppose pi < 1/(2" — 1) and consider an optimal 
length distribution with l\ < v. Consider the weights of 
the nodes of the corresponding code tree at level l\. One 
of these weights is p\, while the rest are known to sum 
to a number no less than l—p\. Thus there is one node 
of at least weight 



1 - Pi 



> 



1 - Pi 



2'i-l - 2 h - 2 ll+1 - u 

and thus, taking the logarithm and adding l\ to the right- 
hand side, 

1 



R*(l,p) >u-l + lg- 



Pi 



2"- 1 - 1 

Note that h + 1 + lgpi < v + \gp\ < v - 1 + lg((l - 
Pi)/(2 U ~ 1 — 1)), a direct consequence of p\ < 1/(2" — 1). 
Thus, if we replace this code with one for which l\ = v, 
the code is still optimal. The tightness of the bound is 
easily seen by applying Lemma Q] to distributions of the 
form 

1 - Pi 1 ~Pi 

Pi' n„ r>> • • • 1 n„ n 



\ 



2" -2 



forpi G (1/(2" 



and thus R* pt (p) 



l),l/2"- 1 ). This results in h = v-1 
= u + lg(l - Pl ) - lg(2 u - 2), which 
no code with l\ > v — 1 could achieve. ■ 
In particular, if p\ > 0.5, l\ = 1, while if p\ < 1/3, 
there is an optimal code with h > 1. 



0.7 - 

g0.6- 
0.4 - 



0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 

Pi 

Fig. 2. Bounds on d th exponential redundancy, valid for any d > 0. 
Upper bounds dashed, lower bounds dotted. 



for any valid I, p, and d > 0, resulting in an extension 
of §), 

0<i?opt(p)<^ pt (P)<^pt(P)<l 

where R^ pt (p) is the optimal d th exponential redundancy, 
an improvement on the bounds found in [17]. This leads 
directly to: 

Corollary 1: The upper bounds of Theorem Q] are 
upper bounds for -Ropt(p) w i tn an Y d, while the lower 
bounds of average redundancy (Huffman) coding [6] are 
lower bounds for R% pt (p) with d > 0. These lower 
bounds are 



22opt(p)>£-(l 
where 

e = 



pi 



lg(2« - 1) - H( Pl ) (10) 



1 - 2« 



l-2w- 
for pi G (0, 1) and 

= — xlgx — (1 — x) lg(l — z). (11) 



C. d'' 1 exponential redundancy bounds 

We now briefly address the d th exponential redundancy 
problem. Recall that this is the minimization of (0), 

R d {p,l)= l -\gY J Prl d{U+lgP > ) - 

A straightforward application of Lyapunov's inequality 
for moments yields R d '(p,l) < R d (p,l) for d' < d, 
which, taking limits to and oo, results in 

< R(p,l) < R d (p,l) < R*(p,l) < 1 



This result is illustrated in Fig. [2j showing an improve- 
ment on the original unit bounds for values of p\ other 
than (negative integer) powers of two. 

III. Bounds on Exponential-Average 
Problems 

A. Previously known exponential-average bounds 

While the average, maximum, and d th average re- 
dundancy problems yield performance bounds in terms 
of p\ alone, here we simply seek to find any bounds 
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on L a (p,l) in terms of p\ and an appropriate entropy 
measure (introduced below). 

Note that a < 0.5 is a trivial case, always solved by 
a finite unary code, 

c u (n) = (0,10,110,... ,i n - 1 o, r- 1 ). 

This can be seen by applying the exponential version 
of the Huffman algorithm; at each step, the combined 
weight will be the lowest weight of the reduced problem, 
being strictly less than the higher of the two combined 
weights, thus leading to a unary code. 

For a > 0.5, there is a relationship between this prob- 
lem and Renyi entropy. Renyi entropy [33] is defined 
as 

1 n 

H a {p) = ~ lgj>? ( 12 ) 

I — a 

i=l 

for a > 0, a ^ 1. It is often defined for a £ {0, 1, oo} 
via limits, that is, 

H (p) = limF a (p) = lg ||p|| 
(the logarithm of the cardinality of p), 

n 

Hi(p) = lim H a (p) = -J^PilgPi 

a— ► ! » — » 



i=l 



(the Shannon entropy of p), and 

#oo(p) = lim H a (p) 

ajoo 



IgPl 



(the min-entropy). 

Campbell first proposed exponential utility functions 
for coding in [21], [22]. He observed the simple lower 
bound for a > 0.5 in [22]; the simple upper bound has 
subsequently shown, e.g., in [34, p. 156] and [26]. These 
bounds are similar to the minimum average redundancy 
bounds. In this case, however, the bounds involve Renyi's 
entropy, not Shannon's. 

Defining 

, , A 1 1 

a{a) 



and 



lg 2a 1 + lg a 



L°v\p) 4 m jn L a (p,l) 



the unit-sized bounds for a > 0.5, a ^ 1 are 



(13) 



0<LT(p)-H a{a) (p)<l. 

In the next subsection we show how this bound follows 
from a result introduced there. 

As an example of these bounds, consider the proba- 
bility distribution implied by Benford's law [35], [36]: 

Pi = log 10 (t + 1) - log 10 (»), i = 1, 2, ... 9 (14) 



that is, 

p « (0.30, 0.17, 0.12, 0.10, 0.08, 0.07, 0.06, 0.05, 0.05). 

At a = 0.6, for example, H a r a -\(p) = 2.259..., so 
the optimal code cost is between 2.259 and 3.260. 
In the application given in [27] with ([7]), this cor- 
responds to an optimal solution with probability of 
success (codeword transmission) between 0.189 and 
0.316. Running the algorithm, the optimal lengths are 
I = (1,2,3,4,5,6,7,8,8), resulting in cost 2.382... 
(probability of success 0.296 . . .). At a = 2, H a ^(p) = 
3.026..., so the optimal code cost is between 3.026 
and 4.027, while the algorithm yields an optimal 
code with / = (2,3,3,3,3,4,4,4,4), resulting in 
cost 3.099. . .. 

Note that the optimal cost in both cases is quite close 
to entropy, indicating that better upper bounds might be 
possible. In looking for better bounds, recall first that 
— as with the exponential Huffman algorithm — ([T3l 
applies for both a G (0.5, 1) and a > 1. Improved bounds 
on the optimal solution for the a > 1 case are given in 
[26], but not in closed form, while closed-form bounds 
for a related objective are given in [28]. However, the 
proof for those bounds is incorrect in that it uses the 
assumption that we will always have an exponential- 
average-optimal l\ equal to 1 if p\ > 0.4. We shortly 
disprove this assumption for a > 1, showing the need for 
modified entropy bounds. Before this, we derive bounds 
based on the results of the prior section. 

B. Better exponential-average bounds 

Because any exponential-average minimization can be 
transformed into a R d minimization problem, we can 
apply Corollary \T\ for the exponential-average problem 
with a > 1 and a similar result where a € (0.5, 1): Given 
an exponential-average minimization problem with p and 
a, if we define a = a(a) = 1/(1 + lga) and 



Pi 



Pi 



P) 



we have 



E]=ipf 2d--)^(p) 



n / n \ a 

= log a J>a*«-log a 5>f 

i=l \i=l J 

= L a (l,p) - Ha(p) 

(15) 

where H a (p) is Renyi entropy, as in ([T2l . This trans- 
formation — shown previously in [26] — provides a 
reduction of exponential-average minimization to d* 
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exponential redundancy. It also shows that improving the 
upper bound of 1 for the redundancy problem given p 
improves it for the exponential-average problem — the 
bounds shown in (fT3l — for p and a > 1. The average- 
redundancy bound ( fTOb is likewise a lower bound. This 
is a strict improvement in upper and lower bound as long 
as pi is not a (negative integer) power of two, where the 
bounds are identical. Thus this yields Campbell's (unit- 
sized) bounds as well. Because p\ can be expressed as 
a function of p\, a, and H^(p), so can this bound: 

Corollary 2: Let us denote the known upper redun- 
dancy bound for optimal average redundancy (Huffman) 
coding as w(pi) < 1, the known lower bound for 
average redundancy as d(pi) > 0, and the known upper 
redundancy bound for minimized maximum pointwise 
redundancy coding as 0J*(px) < 1. (These are given in 
[8], [6], and Theorem [T] respectively.) Then, for a > 1, 
we have 

-^a 2 («-l)ff a ( P )) < L°V\p)-H & (p) 

Similarly, for a G (0.5, 1), we have 

LT(P) < H & (p)+o(pf2^ H ^) 
Proof: The a > 1 case is a direct result of 
Corollary Q] and equation (fT3T ). The a < 1 case is 
similar in nature: Lyapunov's inequality for moments 
yields R d (p, I) < R(p, I) for d < for all I and thus 
^opt(p) — -Ropt(p) < u(pi)- Equation (fT5T ) turns this 
into the above inequality. ■ 
Recall the example of Benford's distribution in ([T4l 
for a = 2. In this case, the bounds improve from 
[3.026 . . . , 4.026 . . .) to [3.039 . . . , 3.910 . . .] using the 
lo* from Theorem Q] and o from [6] given as (fTOl) 
here. For a = 0.6, the bounds on cost are reduced 
from [2.259 . . . , 3.259 . . .) to [2.259 . . . , 2.783 . . .] using 
lo given as (10) in [3]: 

Ro P t{p) <2-H{ Pl )- Pl 

where recall from (TTTb that H(x) = — xlgx — (1 — 
x) lg(l - x). 

Although the bounds derived from Huffman coding 
are close for a ~ 1 (the most common case), these 
are likely not tight bounds; we introduce another bound 
for a < 1 after deriving a certain condition in the next 
section. 

C. Exponential-average shortest codeword length 

Techniques for finding Huffman coding bounds do not 
always translate readily to exponential generalizations 
because Renyi entropy's very definition [33] involves a 



1.0 r 



0.8 - 




0.2 - ; 

0.0 ' 1 1 1 1 1 1 1 1 i— ' ' ' ' ' ' ' ' ' ' ' 1 •— 

0.0 0.2 0.4 0.6 0.8 1.0 

(/ 

Fig. 3. Minimum p\ sufficient for the existence of an optimal h 
not exceeding 1. 

relaxation of a property used in finding bounds such as 
Gallager's entropy bounds [3], namely 

H 1 [tp 1 , (1 - t)pi,p 2 , ...,p n ] = 

Hi[pi,p 2 , . . . , Pn ] +PlH 1 (t, l-t) 

for Shannon entropy Hi and t G [0, 1]. This fails to hold 
for Renyi entropy. The penalty function L a differs from 
the usual measure of expectation in an analogous fashion, 
and we cannot know the weight of a given subtree in 
the optimal code (merged item in the coding procedure) 
simply by knowing the sum probability of the items 
included. However, we can find improved bounds for 
the exponential problem when we know that l\ = 1; the 
question then becomes when we can know this given 
only a and p\ : 

Theorem 3: There exists an optimal code with l\ = 1 
for a and p if either a < 0.5 or both a G (0.5, 1] and 
Pi > 2a/ {2a + 3). Conversely, given a G (0.5,1] and 
pi < 2a/ (2a + 3), there exists a p such that any code 
with l\ = 1 is suboptimal. Likewise, given a > 1 and 
pi < 1, there exists a p such that any code with l\ = 1 
is suboptimal. 

Proof: Recall that the exponential Huffman algo- 
rithm combines the items with the smallest weights, w' 
and w" , yielding a new item of weight w = aw' + aw", 
and this process is repeated on the new set of weights, 
the tree thus being constructed up from the leaves to the 
root. This process makes it clear that, as mentioned, the 
finite unary code (with l\ = 1) is optimal for all a < 0.5. 
This leaves the two nontrivial cases. 

1) a G (0.5,1]: This is a generalization of [4] and 
is only slightly more complex to prove. Consider the 
coding step at which item 1 gets combined with other 
items; we wish to prove that this is the last step. At the 
beginning of this step the (possibly merged) items left 
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If \s. 




SluSi 

w(Sl u Sf) = 

aw(S^) + aw(Sf) 



on 
D 2 




Fig. 4. Tree in last steps of the exponential Huffman algorithm. 



to combine are {1}, S$ , Sf, • • • , S%, where we use Sj to 
denote both a (possibly merged) item of weight w(Sj) 
and the set of (individual) items combined in item Sj. 
Since {1} is combined in this step, all but one Sj has at 
least weight p\. Note too that all weights w(Sj) must be 
less than or equal to the sums of probabilities EieS* Pi- 
Then 

2 ^ k ^ < (k-l) Pl 



2a+3 



< Pi + E J=2 EieS* Pi 

En 
i=lPi 



1 



which, since a > 0.5, means that k < 5. Thus, because 
n < 4 always has an optimal code with l\ = 1, we can 
consider the steps in exponential Huffman coding at and 
after which four items remain, one of which is item {1} 
and the others of which are S\, S\, and S\. We show 
that, if pi > 2/ (2a + 3), these items are combined as 
shown in Fig. |4] 

We assume without loss of generality that weights 
u>(5f ), w (Si)> an d w (Sf) are m descending order. From 



> ,Pi 

i=2 

3 

< 



2a + 3' 
w(S$) > w(Si), 

and w(Sl) > w(Sf) 

it follows that w(S$)+w(Sf) < 2/(2a+3). Consider set 
If its cardinality is 1, then < p%, so the next 

step merges the least two weighted items S3 and Sf. 
Since the merged item has weight at most 2a/ (2a + 3), 
this item can then be combined with Sf, then {1}, so that 
l\ = 1. If S2 is a merged item, let us call the two items 



(sets) that merged to form it S' 2 and S'2 , indicated by the 
dashed nodes in Fig. [4] Because these were combined 
prior to this step, 

w(S' 2 )+w(SZ)< w (Sl)+w(Sj) 



so 



w(Sl) < aw(Sl) + aw(Sf) < 



2a 



2a + 3 



Thus ^(S'l), and by extension w(S^) and w(Sf), are at 
most pi. So 5| and Sf can be combined and this merged 
item can be combined with 5|, then {1}, again resulting 
in li = 1. 

This can be shown to be tight by noting that, for any 

e G (0,(2a - l)/(8a + 12)), 



V 



(0 



2a 
2a+3 



3e, 



1 



2a+3 



1 



2a+3 



1 



2a+3 



+ e 



achieves optimality only with length vector I = 
(2,2,2,2). The result extends to smaller p\. 

2) a > 1: Given a > 1 and p\ < 1, we wish to 
show that a probability distribution always exists such 
that there is no optimal code with l\ = 1. We first show 
that, for the exponential penalties as for the traditional 
Huffman penalty, every optimal I can be obtained via 
the (modified) Huffman procedure. That is, if multiple 
length vectors are optimal, each optimal length vector 
can be obtained by the Huffman procedure as long as 
ties are broken in a certain manner. 

Clearly the optimal code is obtained for n = 2. Let 
n' be the smallest n for which there is an I that is 
optimal but cannot be obtained via the algorithm. Since 
I is optimal, consider the two smallest probabilities, p n > 
and p n '-i. In this optimal code, two items having these 
probabilities (although not necessarily items n' — 1 and 
n 1 ) must have the longest codewords and must have the 
same codeword lengths. Were the latter not the case, the 
codeword of the more probable item could be exchanged 
with one of a less probable item, resulting in a better 
code. Were the former not the case, the longest codeword 
length could be decreased by one without violating the 
Kraft inequality, resulting in a better code. Either way, 
the code would no longer be optimal. So clearly we 
can find two such smallest items with largest codewords 
(by breaking any ties properly), which, without loss of 
generality, can be considered siblings. This means that 
the problem can be reduced to one of size n' — 1 via the 
exponential Huffman algorithm. But since all problems 
of size n' — 1 can be solved via the algorithm, this is a 
contradiction, and the Huffman algorithm can thus find 
any optimal code. 

Note that this is not true for minimizing maximum 
pointwise redundancy, as the exchange argument no 
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longer holds. This is why the sufficient condition of 
the previous section was not verified using Huffman-like 
methods. 

Now we can show that there is always a code with 
l\ > 1 for any p\ G (0.2,1); p\ < 0.2 follows easily. 
Let 

4px 



m 



log Q 



1 



Pi 



and suppose n = 1 + 2 2+m and Pi = (1 — p\)/(n — 1) 
for all i G {2, 3, . . . , re}. This distribution has an optimal 
code only with l\ = 2 (or, if m is equal to the logarithm 
from which it is derived, only with l\ = 2 and with ii = 
3), since, although item 1 need not be merged before the 
penultimate step, at this step its weight is strictly less 
than either of the two other remaining weights, which 



have values w' 



,1+771/ 



1— pi) /2. Thus, knowing merely 



the values of a > 1 and p\ < 1 is not sufficient to ensure 
that h = l. M 

These relations are illustrated in Fig. [51 a plot of the 
minimum value of p\ sufficient for the existence of an 
optimal code with l\ not exceeding 1. 

Similarly to minimum maximum pointwise redun- 
dancy, we can observe that, for a > 1 (that is, a > 1 and 
traditional Huffman coding), a necessary condition for 
l\ = 1 is p\ > 1/3. The sum of the last three combined 
weights is at least 1, and p\ must be no less than the other 
two. However, for a < 1, there is no such necessary 
condition for p\. Given a G (0.5, 1) and p\ E (0, 1), 
consider the probability distribution consisting of one 
item with probability p\ and n = 1 + 2 1+9 items with 
equal probability, where 
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max 



2api 
I- Pi 



l-2pi 
Pi 



.0 



and, by convention, we define the logarithm of negative 
numbers to be — oo. Setting pi = (1 — p\)/(n — 1) for 
all i G {2, 3, . . . , re} results in a monotonic probability 
mass function in which (1 — pi)a 9 /2 < p\, which means 
that the generalized Huffman algorithm will have in its 
penultimate step three items: One of weight p\ and two 
of weight (1 — pi)a 9 /2; these two will be complete 
subtrees with each leaf at depth g. Since (1 — pi)a 9 /2 < 
pi, l\ = 1. Again, this holds for any a G (0.5, 1) and 
Pi G (0, 1), so no nontrivial necessary condition exists 
for l\ = 1. This is also the case for a < 0.5, since the 
unary code is optimal for any probability mass function. 

D. Exponential-average bounds for a G (0.5,1), p\ > 
2a/(2a + 3) 

Entropy bounds derived from Theorem |3j although 
rather complicated, are, in a certain sense, tight: 



Corollary 3: For l\ = 1 (and thus for all p\ > 
2a/(2a+3)) and a G (0.5, 1), the following holds, where 
a = a(a) = 1/(1 + lga): 



^2p l a 1 ' > a 2 



a aH,(p) _ p a 



+ api 



i=l 

or, equivalently, 



L a (p) < l + log a a 



aH^ (p) „S 

a aXH ' — pi 



+ Pi 



and 



a li < a 



a aH & (p) _ p a 



+ api 



i=l 

or, equivalently, 



L a (p) > l + log a 



a &H a (p) _ p & 



+ Pl 



Note that this upper bound is tight for p\ > 0.5, in the 
sense that, given values for a and p\, we find p to make 
the inequality arbitrarily close. Probability distribution 
p = (pi , 1 — p\ + e, e) does this for small e, while 
the lower bound is tight (in the same sense) over its 
full range, since p = (p x , (1 - pi)/4, (1 - pi)/4, (1 - 
pi)/4, (1 — pi)/4) achieves it (with a zero-redundancy 
subtree of the weights excluding p{). 

Proof: We apply the simple unit-sized coding 
bounds (fT"3T ) for the subtree that includes all items but 
item {1}. Let B = {2, 3, . . . , re} with pf = IP 
B] = pi/(l — p\) and with Renyi a-entropy 



i G 



Haip 1 



1 



1 



1) 



n 



a 



^ \ 1 

i=2 



Pi 



Pi 



Ha(p B ) is related to the entropy of the original source 
p by 

2 (l-a)H a (p) = ^ _ p ^a 2 {l-a)H a (p B ) + p a 

or, equivalently, since 2 1 ~ a = a a , 
1 



a &H s (p)_ p & 



I- Pi 

Applying ( fT3l ) to subtree B, we have 
1 



(16) 



> 



(1 ~Pi)a 



E 

i=2 



Pid 



The bounds for YliPi ali are obtained by substituting 
(fl~6l ). multiplying both sides by (1 — p\)a, and adding 
the contribution of item {1}, ap\. ■ 
A Benford distribution (fT4l) for a = 0.6 yields 
#a(a)(p) ~ 2.260. Since p x > 2a/ (2a + 3), h is 1 
and the probability of success is between 0.250 and 
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0.298; that is, L° pt G [2.372 2.707 .. .)■ Recall 
that the bounds found using (fl"5l) were P [success] G 
(0.241,0.316) and L° pt G [2.259 2.783 .. .], an 
improvement on the unit-sized bounds, but not as 
good as those of Corollary [3] The optimal code I = 
(1, 2, 3, 4, 5, 6, 7, 8, 8) yields a probability of success of 
0.296 (L° a pt = 2.382 . . .). 

Note that these arguments fail for a > 1 due to the 
lack of sufficient conditions for l\ = 1. For a < 1, other 
cases likely have improved bounds that could be found 
by bounding l\ — as with the use of lengths in [37] to 
come up with general bounds [7], [8] — but new bounds 
would each cover a more limited range of pi and be more 
complicated to state and to prove. 
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